Skip to content

test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up)#599

Open
wyuc wants to merge 6 commits into
mainfrom
feat/598-eval-orchestration
Open

test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up)#599
wyuc wants to merge 6 commits into
mainfrom
feat/598-eval-orchestration

Conversation

@wyuc
Copy link
Copy Markdown
Contributor

@wyuc wyuc commented May 26, 2026

Closes #598 follow-up items 1 and 2 (and surfaces additional in-product fixes discovered during the prod investigation).

The root symptom this PR addresses: agents drift onto adjacent topics rather than answering the user's literal question ("答非所问"). The director then keeps picking peer agents for "variety" instead of routing back to the teacher to actually answer. Frustration signals from the user are just one sub-case where this is loud enough to notice; the underlying bug fires earlier, often on the first response.

What's in this PR

Commit 1: premature-END regression guard

5 scenarios modelled on #511's described symptoms. A/B across 9 models × 5 scenarios × 5 samples per variant: 0% END rate in both variants on every scenario. The prompt-layer rules added in #554 don't measurably change director END behavior on these inputs. Kept as a regression guard so any future change that raises the END rate trips this eval.

Commit 2: director question-answering eval

5 synthesized scenarios — 3 first-turn drift (no frustration yet), 2 with escalated frustration. Decision rule-judged: USER | TEACHER | OTHER_AGENT | END. A/B: current main vs main with rule 13 stripped.

Commit 3: director rule 13 — Answering the User's Question

Constrained-output rule injected at the end of the # Rules section. Triggers whenever the conversation has an unanswered user question. Constraints: route to LITERAL role: teacher by default, USER cue when ambiguous or after multiple unresolved complaints, forbid peer-agent "variety" routing and END while the question is open.

On google:gemini-3-flash-preview, 5 samples per variant:

Scenario baseline with_rule
math_quadratic_axis_drift_no_frustration 100% 100%
bio_dark_reaction_drift_no_frustration 80% 100%
english_team_isare_drift_no_frustration 20% 100%
physics_inertial_mass_escalated_frustration 0% 100%
calculus_product_rule_drift_no_frustration 60% 100%
mean 52% 100%

3 of 5 scenarios reproduce the bug at baseline (≤60% correct, peer-agent "variety" routing dominates). The rule lifts every scenario to 100%.

Commit 4: agent rule — Answering the User's Question

Adds a section to the agent system prompt that directs the agent, on every response where the user message contains a question, to lead with the concrete answer to the literal question before any other content. Pairs with the director rule — director routes to the teacher when the question is unanswered; the teacher now has explicit guidance for what to do when dispatched.

Commit 5: rip maxTurns end-to-end + distinguish errors from completion

maxTurns rip: the director graph used maxTurns: turnCount + 1 to force "one director→agent cycle per request" — a hack expressing the single-round contract as a turn limit. Replaced with explicit graph topology (agent_generate → END). The user-facing "Max Discussion Turns" setting is also removed: with rule-driven routing (cue_user / END), a numeric cap is redundant and was previously masking director bugs by force-ending.

Error vs completion: before, every non-cue_user outcome of the client agent loop mapped to status: 'completed', conflating end (legitimate completion) with empty_turns (parser failure / model timeout) and no_done (SSE infra failure). The new switch routes the error paths through clearLiveSessionAfterError, which appends an inline System message and sets the new status: 'error'. The session list gets a red AlertCircle icon for the error state.

Touches: director-graph.ts, lib/types/chat.ts, lib/store/settings.ts, components/agent/agent-bar.tsx, components/settings/agent-settings.tsx, lib/chat/agent-loop.ts, components/chat/use-chat-sessions.ts, components/chat/session-list.tsx, eval/whiteboard-layout/runner.ts, 6 i18n locales.

Test plan

  • pnpm check clean
  • pnpm test — no new failures (9 pre-existing ssrf-guard DNS-bound failures unrelated)
  • npx tsc --noEmit -p tsconfig.json clean
  • EVAL_DIRECTOR_MODEL=<model> pnpm eval:orchestration — premature-END eval passes (regression guard)
  • EVAL_DIRECTOR_MODEL=<model> npx tsx --env-file=.env.local eval/orchestration/answering-runner.ts — 5/5 scenarios PASS, mean 100% correct rate with rule, baseline 52%
  • dev server sanity (manual) — multi-agent discussion, user asks specific question → check teacher answers directly; agent failure to produce output → check session-list error icon

@wyuc wyuc changed the title test(orchestration): premature-END regression eval (follow-up to #554) test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced) May 26, 2026
@wyuc wyuc force-pushed the feat/598-eval-orchestration branch from 6cd1160 to ca403bc Compare May 27, 2026 02:47
@wyuc wyuc changed the title test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced) test(orchestration): premature-END regression eval + frustration-handling eval (#598 follow-up) May 27, 2026
@wyuc wyuc force-pushed the feat/598-eval-orchestration branch from 135f659 to f5950ca Compare May 27, 2026 09:21
@wyuc wyuc changed the title test(orchestration): premature-END regression eval + frustration-handling eval (#598 follow-up) test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up) May 27, 2026
@wyuc wyuc force-pushed the feat/598-eval-orchestration branch from f5950ca to ca4ad98 Compare May 27, 2026 09:31
wyuc added 5 commits May 27, 2026 05:31
Adds eval/orchestration/ following the outline-language pattern: a runner
that A/B-tests the director against the same scenario with two prompt
variants — current main (post-#554) and a synthesised pre-#554 baseline
that strips the role-aware summary labels AND the new system.md rules
10/11/12 together.

Pass criterion is now framed as a regression guard rather than a fixture
discrimination test: every scenario's post-fix END rate must stay below
EVAL_END_THRESHOLD (default 20%). The pre-vs-post Δ is reported as
informational data, since #554's reviewer feedback was that earlier
fixtures didn't discriminate.

Empirical finding across 7 shipped model configs (gpt-4.1-mini,
gpt-4o-mini, gpt-5.4-nano, qwen-plus, qwen3.5-flash, deepseek-chat,
deepseek-v4-flash, gemini-2.5-flash, claude-haiku-4-5) with 5 scenarios
modelled on #511 (incl. the exact Tiananmen-3D objection trace, soft
pushback after a long resolved-looking discussion, topic pivot, brief
acknowledgement, and explicit teacher-signals-end-then-user-objects):
no scenario produced a non-zero END rate in either variant. Useful
data for the #554/#598 discussion — the prompt-layer rules don't
measurably change behavior on these shipped models with these prompts,
but the eval is now wired so future regressions show up.
…w-up)

Adds an A/B eval probing whether the director routes correctly when the
conversation contains an unanswered user question — both in the first-turn
drift case (agents started answering but drifted onto adjacent topics,
no frustration yet) and in the escalated frustration case (user has
already complained multiple times).

The prod symptom this targets ("答非所问"): agents reply on adjacent
topics rather than answering the literal question, and the director
keeps picking peer agents for "variety" instead of routing back to
the teacher to actually answer.

Scenarios: 5 synthesized cases across math, biology, English grammar,
physics, and calculus. 3 are first-turn drift (no frustration yet),
2 include frustration signals. Each captures: user asks specific
question, agents reply on adjacent topics, director is asked to pick
the next move.

Decision rule-judged into USER | TEACHER | OTHER_AGENT | END:
  - USER       → ✓ valid (cue user to clarify)
  - TEACHER    → ✓ valid (re-route to teacher to answer)
  - OTHER_AGENT → ✗ wrong (peer-agent "variety" routing — the bug)
  - END        → ✗ wrong

A/B:
  - baseline  : current director template with rule 13 stripped
  - with_rule : current director template as-shipped

Pass = with_rule.correctRate ≥ EVAL_PASS_THRESHOLD (default 0.7).
Pre-vs-post Δ is reported as informational only.

Note this commit only adds the eval; the rule it strips/restores in
the A/B is injected in the next commit.
Adds rule 13 to lib/prompts/templates/director/system.md so the director
recognizes when the conversation contains an unanswered user question
and routes appropriately — to the teacher (literal role match) by
default, or to USER cue when the question is ambiguous or the user
has already complained multiple times.

This addresses the "答非所问" symptom: when agents drift onto adjacent
topics, the director keeps picking peer agents for "variety" rather
than routing back to the teacher to actually answer. The rule is
preventive (applies whenever the question is unanswered, including
the first turn after drift) rather than reactive-only (waiting for
the user to express frustration).

Rule wording uses constrained-output framing: literal `role: teacher`
match (not "best fit" / "highest priority" / "most knowledgeable"),
and an explicit list of forbidden choices (peer agents, END). Earlier
softer wordings let the model pick the assistant when it seemed more
topically relevant; the literal-match wording prevents that.

Validated by the question-answering eval added in the previous commit.
On google:gemini-3-flash-preview, 5 samples per variant across 5
synthesized scenarios:

| Scenario                                           | baseline | with_rule |
|----------------------------------------------------|----------|-----------|
| math_quadratic_axis_drift_no_frustration           | 100%     | 100%      |
| bio_dark_reaction_drift_no_frustration             | 80%      | 100%      |
| english_team_isare_drift_no_frustration            | 20%      | 100%      |
| physics_inertial_mass_escalated_frustration        | 0%       | 100%      |
| calculus_product_rule_drift_no_frustration         | 60%      | 100%      |
| **mean**                                           | **52%**  | **100%**  |

3 of 5 scenarios reproduce the bug at baseline (correct rate ≤ 60%);
the rule lifts every scenario to 100%. The remaining 2 baselines are
already high — the rule doesn't regress them.
Adds an "Answering the User's Question" section to the agent system
prompt. Before this, the agent SP had no rule for direct-answering — the
persona, role guideline, current-state context, and length guidelines all
pushed the agent to advance the lesson and inspire thought, with no
explicit directive to first answer the user's literal question.

The result observed in prod ("答非所问"): when the user asks a specific
question (a formula, a yes/no, a term), agents reply with adjacent
concepts, examples, or follow-up questions — but never the literal
answer.

The new section directs the agent, on every response where the user
message contains a question, to:
- lead with the concrete answer to the literal question,
- not pivot to an adjacent topic even when it seems pedagogically
  richer,
- treat "inspire thought" and peer-differentiation guidance as
  applying only AFTER the literal question is answered,
- say "I don't know" directly when uncertain rather than answering
  a different question,
- on frustration signals from prior turns, look back at the original
  question (before the frustration) and answer THAT specifically.

This is the agent-side counterpart to the director rule landed in the
previous commit. Director routes to the teacher when the question is
unanswered; the teacher now has explicit guidance for what to do when
dispatched.
…ors from completion

Two coupled cleanups in one commit since the outcome-handler rewrite
naturally falls out of removing the client-side max-turns loop.

The director graph used `maxTurns: turnCount + 1` to force "one
director→agent cycle per request" — a hack expressing the single-round
contract as a turn limit. This commit removes the hack and the
surrounding user-facing setting, replacing implicit intent with explicit
graph topology.

Server:
- `director-graph.ts`: drop `maxTurns` from state, drop the hard-cap
  check at the top of `directorNode`, change `agent_generate → director`
  edge to `agent_generate → END`. Single-round contract is now expressed
  by topology.
- `lib/types/chat.ts`: drop `maxTurns` and `currentTurn` from
  `SessionConfig`; drop `maxTurns?` from `CreateSessionRequest.trigger`.

Client:
- `lib/store/settings.ts`: drop `maxTurns` state, `setMaxTurns` action,
  the `'10'` default, and the localStorage migration path.
- `components/agent/agent-bar.tsx`: drop the entire compact-stepper UI
  block plus the unused MessageSquare/Minus/Plus lucide imports.
- `components/settings/agent-settings.tsx`: drop the AgentSettings prop
  and the multi-agent-only Input block.
- `lib/chat/agent-loop.ts`: drop `maxTurns` parameter, change
  `while (turnCount < maxTurns)` to `while (true)` controlled by inner
  exits, drop `'max_turns'` reason from `AgentLoopOutcome`.
- `components/chat/use-chat-sessions.ts`: drop the maxTurns lookup, drop
  the trailing arg to runAgentLoop, drop vestigial `maxTurns: 0` /
  `currentTurn: 0` from session creation.
- `eval/whiteboard-layout/runner.ts`: drop unused MAX_AGENT_TURNS plus
  the trailing arg.

i18n: drop `settings.maxTurns` and `settings.maxTurnsDesc` from six
locales (en-US, zh-CN, zh-TW, ja-JP, ar-SA, ru-RU).

Motivation for removing the user-facing setting: the LLM director
controls round length via cue_user / END decisions, so a numeric
"max discussion turns" knob is redundant with rule-driven routing.
When the director prompt was buggy, the cap masked the bug by force-
ending; once director routing is fixed, the cap just removes user
agency.

Before, every non-`cue_user` outcome of the agent loop mapped to
`status: 'completed'` in `use-chat-sessions.ts`. That conflated four
distinct paths:
- `end`: director chose END (legitimate completion)
- `empty_turns`: two consecutive agents returned no text or actions
  (parser failure, model timeout, etc.)
- `no_done`: SSE stream ended without a `done` event (infrastructure
  failure)
- `aborted`: user pressed stop (handled elsewhere)

The new switch routes `end` to completed, `empty_turns` and `no_done`
through `clearLiveSessionAfterError` so the user sees an inline System
message describing what went wrong instead of a quiet "completed".

- `lib/types/chat.ts`: add `'error'` to `SessionStatus`.
- `components/chat/use-chat-sessions.ts`: rewrite the outcome handler
  as a switch; have `clearLiveSessionAfterError` set `status: 'error'`
  so all entry points (the runAgentLoop outcome AND existing throw
  paths from fetchChat / SSE error events) converge on the same
  visible state.
- `components/chat/session-list.tsx`: add `AlertCircle` red icon for
  the new error status.
- i18n: add `chat.error.emptyAgentResponses` and
  `chat.error.streamInterrupted` strings to all six locales.

This is the in-product fix that the prod investigation surfaced as
an independent root cause of the #511 "discussion ended unexpectedly"
symptom — separate from the director and agent prompt fixes in the
earlier commits of this PR.
@wyuc wyuc force-pushed the feat/598-eval-orchestration branch from ca4ad98 to f3fa658 Compare May 27, 2026 09:33
…nswering script

Two minor CR follow-ups:

- components/chat/use-chat-sessions.ts: the `settingsState` lookup at the
  top of runAgentLoopFn was the entry point for the maxTurns lookup
  ripped in the previous commit; with maxTurns gone, the binding is
  unused and ESLint flags it. Drop the line. (Other uses of
  useSettingsStore.getState() elsewhere in this file are unaffected.)

- package.json: the answering eval runner had no convenience script,
  while the premature-end runner did. Add eval:orchestration:answering
  for symmetry and discoverability.
@wyuc
Copy link
Copy Markdown
Contributor Author

wyuc commented May 27, 2026

CR loop summary (2 rounds, converged)

Ran 2 rounds of code review via the superpowers:requesting-code-review subagent on the full PR diff (origin/main..HEAD).

Round 1 flagged 2 Important issues:

  1. components/chat/use-chat-sessions.ts:460 — dead settingsState binding (orphan from the maxTurns rip), tripping @typescript-eslint/no-unused-vars.
  2. package.json — missing eval:orchestration:answering convenience script (asymmetric with eval:orchestration).

Both fixed in 9a90e84 (4 lines).

Round 2 verified:

  • ESLint clean on the touched file.
  • pnpm exec tsc --noEmit clean.
  • pnpm check (prettier) clean.
  • pnpm check:i18n-keys passes (7 locales, including pt-BR).
  • AgentLoopOutcome.reason union is exhaustively handled by the new outcome switch.
  • maxTurns / currentTurn / max_turns rip is complete; only remaining references are two explanatory comments in director-graph.ts.

Round 2 verdict: loop converged, no new findings.

Out-of-scope items deliberately left for follow-up (called out but not addressed in this PR):

  • Pre-existing directorNode error catch still routes to shouldEnd: true (i.e. completed rather than error) — independent path from this PR's outcome-handler upgrade.
  • AgentSettings component is orphaned on origin/main too; not introduced by this PR.
  • Retry-last-message affordance for the new 'error' session state — separate UX work.

@wyuc wyuc marked this pull request as ready for review May 27, 2026 15:47
@wyuc wyuc requested a review from cosarah May 27, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Follow-up to #554: remaining items for #511

1 participant